Kaggle describe this competition as follows:
People interested in renting an apartment or home, share information about themselves and their property on Airbnb. Those who end up renting the property share their experiences through reviews. The dataset describes property, host, and reviews for over 40,000 Airbnb rentals in New York along 90 variables.*
Loading R packages used besides base R.
library(knitr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.90 loaded
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(rapport)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(cattonum)
## cattonum is seeking a new maintainer; please respond if interested: https://github.com/bfgray3/cattonum/issues/40
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(skimr)
library(vtreat)
## Loading required package: wrapr
##
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
##
## coalesce
## The following objects are masked from 'package:tidyr':
##
## pack, unpack
## The following object is masked from 'package:tibble':
##
## view
Reading two csv files as dataframes into R.
data = read.csv("/Users/chensixian/Desktop/CU AA/5200/Kaggle/rentlala2021/analysisData.csv")
ScoringData = read.csv("/Users/chensixian/Desktop/CU AA/5200/Kaggle/rentlala2021/ScoringData.csv")
All variables should be carefully examined, understand what they means could help a lot in data cleaning and feature engineering.
str(data)
## 'data.frame': 41330 obs. of 91 variables:
## $ id : int 914965 262763 279053 385319 476558 522365 803935 373951 729674 372028 ...
## $ name : chr "<BRAND NEW BUILDING! 2 BR 1. 5 BA in Trendy Bushwick>" "<Brooklyn Brownstone parlor living with full A/C>" "<Posh one bedroom apartment in Brooklyn>" "<COMFY ROOM in Chelsea for long term>" ...
## $ summary : chr "Trendy, artsy buswick. Go where the NYC artists are! Cafe's, bars, and the subway to downtown is a 4 minu"| __truncated__ "Full parlor floor in a historic Brooklyn Brownstone in Prospect Heights, we are within blocks of Prospect Park"| __truncated__ "Brand new super clean and glamorous one bedroom garden apartment with central air/heat in the heart of trendy P"| __truncated__ "*Beautiful room available for long term (prefer >30days)* - Perfect for 1 person, possible but not recommende"| __truncated__ ...
## $ space : chr "There is a full bath and master suite bedroom with Queen bed on the first floor. Living room with dining area "| __truncated__ "" "The apartment has its own entry/exit, is impeccably clean and is equipped with all basic necessities including"| __truncated__ "The kitchen is spacious and have enough equipments for you to cook any basic dishes! There’s a dining table wi"| __truncated__ ...
## $ description : chr "Trendy, artsy buswick. Go where the NYC artists are! Cafe's, bars, and the subway to downtown is a 4 minu"| __truncated__ "Full parlor floor in a historic Brooklyn Brownstone in Prospect Heights, we are within blocks of Prospect Park"| __truncated__ "Brand new super clean and glamorous one bedroom garden apartment with central air/heat in the heart of trendy P"| __truncated__ "*Beautiful room available for long term (prefer >30days)* - Perfect for 1 person, possible but not recommende"| __truncated__ ...
## $ neighborhood_overview : chr "" "" "Short and easy walk to: - Grand army plaza - Prospect Park - Brooklyn botanic gardens - Brooklyn museum - Atlan"| __truncated__ "There are many very good restaurants in the neighborhood. Chipotle, Pizza and very good ramen and sushi are j"| __truncated__ ...
## $ notes : chr "" "" "While the household and building is rather academic and we socialize a fair bit, we stay away from loud partie"| __truncated__ "I have a secret back yard so if you’re an awesome guest I will invite you out there for tea or wine! :)" ...
## $ transit : chr "The J/Z subway is a 4 minute walk and the L train is a 7 minute walk. 25 minutes into downtown Manhattan!" "" "Several subway stops are within a short walk; the closest line is 2 blocks away. " "Parking is available for free right in the front after 7pm, and 1 hour meter is from 7:30am to 7 pm. Across "| __truncated__ ...
## $ access : chr "" "There will be a key lockbox on the front gate. Walk up the brownstown front stairs and open the door by turnin"| __truncated__ "Access to a full size washer and dryer (detergent provided)" "Bathroom and living room" ...
## $ interaction : chr "I live locally, so I am available. " "You will have access and use of 2 bedrooms which includes a queen sized bed, air mattress, and extra-long couch. " "We are a low key couple and live in the building and available as much or as little as needed. " "I’m available anytime if you have any questions! I would love to get to know you and hear your stories but if y"| __truncated__ ...
## $ house_rules : chr "" "No parties! We expect any guest to be added to the reservation in advance. Please respect this home as your ow"| __truncated__ "Smoking and vaping (e-cigarettes) anywhere on the premises is STRICTLY PROHIBITED; Overnight visitors NOT allo"| __truncated__ "Please always clean up after yourself when you use the bathroom or kitchen, take your own trash out and keep t"| __truncated__ ...
## $ host_name : chr "Andrea" "Jeremy" "Foster" "Sophie" ...
## $ host_since : chr "2019-06-13" "2011-06-08" "2013-01-20" "2014-08-18" ...
## $ host_location : chr "US" "New York, New York, United States" "New York, New York, United States" "Ossining, New York, United States" ...
## $ host_about : chr "" "Jeremy and Kristy live in Brooklyn and East Hampton NY. Jeremy is in technology/finance and Kristy is in archit"| __truncated__ "" "" ...
## $ host_response_time : chr "within a few hours" "within an hour" "within a day" "within an hour" ...
## $ host_response_rate : chr "70%" "86%" "100%" "100%" ...
## $ host_acceptance_rate : chr "N/A" "25%" "66%" "98%" ...
## $ host_is_superhost : chr "f" "f" "t" "f" ...
## $ host_neighbourhood : chr "Bushwick" "" "Prospect Heights" "Greenwich Village" ...
## $ host_listings_count : int 0 1 1 3 6 1 1 221 1 2 ...
## $ host_total_listings_count : int 0 1 1 3 6 1 1 221 1 2 ...
## $ host_verifications : chr "['phone']" "['email', 'phone', 'facebook', 'reviews', 'jumio', 'government_id', 'work_email']" "['email', 'phone', 'reviews', 'jumio', 'government_id']" "['email', 'phone', 'reviews', 'jumio', 'government_id']" ...
## $ host_has_profile_pic : chr "t" "t" "t" "t" ...
## $ host_identity_verified : chr "f" "t" "f" "t" ...
## $ street : chr "Brooklyn, NY, United States" "New York, NY, United States" "Brooklyn, NY, United States" "New York, NY, United States" ...
## $ neighbourhood : chr "Brooklyn" "Brooklyn" "Prospect Heights" "Manhattan" ...
## $ neighbourhood_cleansed : chr "Bushwick" "Prospect Heights" "Prospect Heights" "Greenwich Village" ...
## $ neighbourhood_group_cleansed : chr "Brooklyn" "Brooklyn" "Brooklyn" "Manhattan" ...
## $ city : chr "Brooklyn" "New York" "Brooklyn" "New York" ...
## $ state : chr "NY" "NY" "NY" "NY" ...
## $ zipcode : chr "11207" "11238" "11238" "10018" ...
## $ market : chr "New York" "New York" "New York" "New York" ...
## $ smart_location : chr "Brooklyn, NY" "New York, NY" "Brooklyn, NY" "New York, NY" ...
## $ country_code : chr "US" "US" "US" "US" ...
## $ country : chr "United States" "United States" "United States" "United States" ...
## $ is_location_exact : chr "t" "f" "t" "t" ...
## $ property_type : chr "Condominium" "Apartment" "Guest suite" "Apartment" ...
## $ room_type : chr "Entire home/apt" "Entire home/apt" "Entire home/apt" "Private room" ...
## $ accommodates : int 5 4 3 2 2 2 1 6 2 8 ...
## $ bathrooms : num 1.5 1 1 1 1.5 1 1 2 1 1.5 ...
## $ bedrooms : int 2 2 1 1 1 1 1 2 1 3 ...
## $ beds : int 2 1 1 1 1 1 1 3 1 4 ...
## $ bed_type : chr "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
## $ amenities : chr "TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , Fire extin"| __truncated__ "TV,Internet,Wifi,Kitchen, Pets live on this property ,Dog(s), Free street parking , Buzzer/wireless intercom ,H"| __truncated__ "TV,Wifi, Air conditioning ,Kitchen,Heating,Washer,Dryer, Smoke detector , Carbon monoxide detector , First aid "| __truncated__ "Wifi, Air conditioning ,Kitchen, Buzzer/wireless intercom ,Heating, Smoke detector , Carbon monoxide detector ,"| __truncated__ ...
## $ square_feet : int NA NA NA NA NA NA NA NA NA NA ...
## $ price : int 145 199 165 59 48 80 58 339 65 119 ...
## $ weekly_price : int NA NA NA NA NA NA NA NA 650 NA ...
## $ monthly_price : int NA NA NA NA NA NA NA NA NA NA ...
## $ security_deposit : int 0 399 0 0 0 NA NA 0 0 100 ...
## $ cleaning_fee : int 110 50 75 49 25 45 NA 99 50 95 ...
## $ guests_included : int 1 2 2 1 1 1 1 1 1 7 ...
## $ extra_people : int 0 50 50 25 15 0 0 0 10 25 ...
## $ minimum_nights : int 6 2 2 30 2 2 3 3 5 3 ...
## $ maximum_nights : int 29 1125 12 1125 1125 30 31 1125 32 3 ...
## $ minimum_minimum_nights : int 6 2 1 30 2 2 3 3 5 3 ...
## $ maximum_minimum_nights : int 6 2 3 30 2 2 3 3 5 3 ...
## $ minimum_maximum_nights : int 29 1125 12 1125 1125 30 31 1125 32 3 ...
## $ maximum_maximum_nights : int 29 1125 12 1125 1125 30 31 1125 32 3 ...
## $ minimum_nights_avg_ntm : num 6 2 2 30 2 2 3 3 5 3 ...
## $ maximum_nights_avg_ntm : num 29 1125 12 1125 1125 ...
## $ calendar_updated : chr "2 weeks ago" "3 weeks ago" "yesterday" "4 weeks ago" ...
## $ has_availability : chr "t" "t" "t" "t" ...
## $ availability_30 : int 0 8 17 11 0 5 0 0 0 22 ...
## $ availability_60 : int 0 38 28 11 0 35 0 0 0 33 ...
## $ availability_90 : int 0 68 48 11 12 65 0 0 0 50 ...
## $ availability_365 : int 18 158 112 35 73 65 0 48 66 130 ...
## $ number_of_reviews : int 2 5 52 30 14 20 1 28 6 23 ...
## $ number_of_reviews_ltm : int 2 4 45 29 14 20 0 1 0 10 ...
## $ first_review : chr "2019-07-15" "2017-04-11" "2019-01-21" "2018-05-19" ...
## $ last_review : chr "2019-07-20" "2019-12-31" "2020-03-01" "2020-01-01" ...
## $ review_scores_rating : int 100 92 100 94 99 100 100 99 96 93 ...
## $ review_scores_accuracy : int 10 10 10 9 10 10 10 10 10 9 ...
## $ review_scores_cleanliness : int 10 10 10 9 9 10 10 10 10 9 ...
## $ review_scores_checkin : int 10 10 10 9 10 10 10 10 10 9 ...
## $ review_scores_communication : int 10 10 10 10 10 10 10 10 10 9 ...
## $ review_scores_location : int 10 10 10 10 10 10 10 10 10 9 ...
## $ review_scores_value : int 10 8 10 9 10 10 10 9 10 9 ...
## $ requires_license : chr "f" "f" "f" "f" ...
## $ license : chr "" "" "" "" ...
## $ jurisdiction_names : chr "" "" "" "" ...
## $ instant_bookable : chr "f" "f" "f" "f" ...
## $ is_business_travel_ready : chr "f" "f" "f" "f" ...
## $ cancellation_policy : chr "strict_14_with_grace_period" "strict_14_with_grace_period" "strict_14_with_grace_period" "strict_14_with_grace_period" ...
## $ require_guest_profile_picture : chr "f" "f" "f" "f" ...
## $ require_guest_phone_verification : chr "f" "f" "f" "f" ...
## $ calculated_host_listings_count : int 1 1 1 2 3 1 1 12 2 2 ...
## $ calculated_host_listings_count_entire_homes : int 1 1 1 0 0 0 0 12 1 2 ...
## $ calculated_host_listings_count_private_rooms: int 0 0 0 2 3 1 1 0 1 0 ...
## $ calculated_host_listings_count_shared_rooms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reviews_per_month : num 2 0.14 3.72 1.35 1.34 2.33 0.04 1.22 0.07 0.98 ...
dim(data)
## [1] 41330 91
We can see “Data” has 41330 observations and 91 features.
As the plot shows, the airbnb rent price are right skewed. This means only small proportion of houses are expensive.
## Target Variable Exploration
ggplot(data, aes(price)) +
geom_histogram(fill = "red",bins = 100) +
labs(title ="Price Distribution") +
scale_x_continuous(breaks= seq(0, 1050, by=50))
summary(data$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 66 100 137 170 999
I want to see the correlation within all numeric variables
numericVars <- which(sapply(data, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on
cat('There are', length(numericVars), 'numeric variables')
## There are 41 numeric variables
all_numVar <- data[, numericVars]
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
#sort on decreasing correlations with price
cor_sorted <- as.matrix(sort(cor_numVar[,'price'], decreasing = TRUE))
#select corelations > 0.2
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.2)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")
Accommodates has high correlation with price. It also becomes clear the multicollinearity is an issue. For example: the correlation accommodates,bedrooms, beds is very high, we might consider drop some of these variables.
train <- data
test <- ScoringData
Combine Training set, validation set, and test set.
train_labels <- train$id
test_labels <- test$id
#test$id <- NULL
#train$id <- NULL
#validation$id <- NULL
test$price <- NA
all <- rbind(train,test)
dim(all)
## [1] 51663 91
Create test set with NA price.
Using function Hmisc::describe(), we can examine our variables.
# Hmisc::describe(all)
I comment the code because the output contains too much information.
We can see from the output, some features contain too many missing values and some features are text. Due to limit time and effort, I will exclude them from our data.
colnames = c("name","summary","space","description","neighborhood_overview",
"notes","host_location",
"transit","access","interaction","house_rules","host_name",
"host_about","street",
"city","state","market","smart_location",
"country_code","country","license","jurisdiction_names")
all[,colnames] <- list(NULL)
How many numeric variables have missing value? Let’s take a closer look!
Index of variables with missing values.
NAcol <- which(colSums(is.na(all)) > 0)
NAcol
## host_listings_count host_total_listings_count beds
## 8 9 23
## square_feet price weekly_price
## 26 27 28
## monthly_price security_deposit cleaning_fee
## 29 30 31
## reviews_per_month
## 69
We can see from below, variables like “monthly_price”,“square_feet”,“weekly_price”,“security_deposit” all have large amount of missing value.
sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
## monthly_price square_feet weekly_price
## 51418 51235 47635
## security_deposit price cleaning_fee
## 16722 10333 7474
## beds host_listings_count host_total_listings_count
## 153 10 10
## reviews_per_month
## 2
There are 10 numeric variables have missing values.
In this section, I am going to fix the 10 variables that contains missing values.
Besides making sure that the missing values are taken care off.I would convert categorical data into factors, but for variables like zipcode or neighbourhood_cleansed, I would use another approach to encode them.
We select all variables that should be logical from dataset.
cl <- c("host_is_superhost","host_has_profile_pic","host_identity_verified",
"is_location_exact","has_availability","requires_license","instant_bookable",
"is_business_travel_ready","require_guest_profile_picture","require_guest_phone_verification")
Inspect variables.
Hmisc::describe(all[,cl])
## all[, cl]
##
## 10 Variables 51663 Observations
## --------------------------------------------------------------------------------
## host_is_superhost
## n missing distinct
## 51653 10 2
##
## Value f t
## Frequency 40335 11318
## Proportion 0.781 0.219
## --------------------------------------------------------------------------------
## host_has_profile_pic
## n missing distinct
## 51653 10 2
##
## Value f t
## Frequency 130 51523
## Proportion 0.003 0.997
## --------------------------------------------------------------------------------
## host_identity_verified
## n missing distinct
## 51653 10 2
##
## Value f t
## Frequency 27837 23816
## Proportion 0.539 0.461
## --------------------------------------------------------------------------------
## is_location_exact
## n missing distinct
## 51663 0 2
##
## Value f t
## Frequency 8928 42735
## Proportion 0.173 0.827
## --------------------------------------------------------------------------------
## has_availability
## n missing distinct value
## 51663 0 1 t
##
## Value t
## Frequency 51663
## Proportion 1
## --------------------------------------------------------------------------------
## requires_license
## n missing distinct value
## 51663 0 1 f
##
## Value f
## Frequency 51663
## Proportion 1
## --------------------------------------------------------------------------------
## instant_bookable
## n missing distinct
## 51663 0 2
##
## Value f t
## Frequency 31160 20503
## Proportion 0.603 0.397
## --------------------------------------------------------------------------------
## is_business_travel_ready
## n missing distinct value
## 51663 0 1 f
##
## Value f
## Frequency 51663
## Proportion 1
## --------------------------------------------------------------------------------
## require_guest_profile_picture
## n missing distinct
## 51663 0 2
##
## Value f t
## Frequency 50470 1193
## Proportion 0.977 0.023
## --------------------------------------------------------------------------------
## require_guest_phone_verification
## n missing distinct
## 51663 0 2
##
## Value f t
## Frequency 50358 1305
## Proportion 0.975 0.025
## --------------------------------------------------------------------------------
From the table we can see that some variables(has_availability,requires_license,is_business_travel_ready) have only 1 level. These variables do not help fitting the model, we should drop them.
Variables, host_is_superhost and host_identity_verified have empty value. We can impute them with majority class.
all %>%
group_by(host_response_rate) %>%
count()
## # A tibble: 92 x 2
## # Groups: host_response_rate [92]
## host_response_rate n
## <chr> <int>
## 1 "" 10
## 2 "0%" 569
## 3 "10%" 17
## 4 "100%" 26023
## 5 "11%" 5
## 6 "13%" 5
## 7 "14%" 4
## 8 "17%" 10
## 9 "18%" 5
## 10 "20%" 45
## # … with 82 more rows
cl <- c("host_is_superhost","host_identity_verified","is_location_exact","instant_bookable")
all <- all %>%
select(!c("has_availability","requires_license","is_business_travel_ready","host_has_profile_pic")) %>%
mutate(host_is_superhost = if_else(host_is_superhost=="","f",host_is_superhost)) %>%
mutate(host_identity_verified = if_else(host_identity_verified=="","f",host_identity_verified)) %>%
mutate(host_is_superhost = if_else(host_is_superhost=="t","1","0")) %>% # t <- 1, f <- 0
mutate(host_identity_verified = if_else(host_identity_verified=="t","1","0")) %>%
mutate(is_location_exact = if_else(is_location_exact=="t","1","0")) %>%
mutate(instant_bookable= if_else(instant_bookable=="t","1","0")) %>%# t <- 1, f <- 0
mutate_at(cl,as.numeric) #Convert string to numeric
The host_response_rate should be numerical, but we have 14644 NA, We can just drop this column.
Hmisc::describe(all$host_response_rate)
## all$host_response_rate
## n missing distinct
## 51653 10 91
##
## lowest : 0% 10% 100% 11% 13% , highest: 96% 97% 98% 99% N/A
Hmisc::describe(all$host_response_time)
## all$host_response_time
## n missing distinct
## 51653 10 5
##
## lowest : a few days or more N/A within a day within a few hours within an hour
## highest: a few days or more N/A within a day within a few hours within an hour
##
## Value a few days or more N/A within a day
## Frequency 1209 14695 4770
## Proportion 0.023 0.284 0.092
##
## Value within a few hours within an hour
## Frequency 7028 23951
## Proportion 0.136 0.464
ggplot(train,aes(host_response_time,price))+
geom_boxplot()
all%>%
dplyr::group_by(host_response_time) %>%
dplyr::count() %>%
dplyr::arrange(desc(n))
## # A tibble: 6 x 2
## # Groups: host_response_time [6]
## host_response_time n
## <chr> <int>
## 1 "within an hour" 23951
## 2 "N/A" 14695
## 3 "within a few hours" 7028
## 4 "within a day" 4770
## 5 "a few days or more" 1209
## 6 "" 10
all <- all %>%
mutate(host_response_time = if_else(host_response_time=="","unknown",host_response_time)) %>%
mutate(host_response_time = if_else(host_response_time=="N/A","unknown",host_response_time))
We convert all empty strings and NA into one category called “unknown”.
Hmisc::describe(all$property_type)
## all$property_type
## n missing distinct
## 51663 0 35
##
## lowest : Aparthotel Apartment Barn Bed and breakfast Boat
## highest: Timeshare Tiny house Townhouse Treehouse Villa
No missing values. Very good!
unique(train$property_type)
## [1] "Condominium" "Apartment" "Guest suite"
## [4] "House" "Hotel" "Townhouse"
## [7] "Guesthouse" "Other" "Serviced apartment"
## [10] "Boutique hotel" "Loft" "Bed and breakfast"
## [13] "Villa" "Resort" "Bungalow"
## [16] "Hostel" "Cave" "Aparthotel"
## [19] "Tiny house" "Earth house" "Cabin"
## [22] "Dome house" "Camper/RV" "Cottage"
## [25] "Casa particular (Cuba)" "Boat" "Treehouse"
## [28] "Tent" "Barn" "Houseboat"
## [31] "Castle" "Island" "Nature lodge"
unique(test$property_type)
## [1] "Apartment" "Condominium" "Loft"
## [4] "Townhouse" "House" "Boutique hotel"
## [7] "Guest suite" "Tiny house" "Bed and breakfast"
## [10] "Guesthouse" "Serviced apartment" "Cottage"
## [13] "Villa" "Castle" "Hostel"
## [16] "Other" "Houseboat" "Hotel"
## [19] "Resort" "Timeshare" "Aparthotel"
## [22] "Boat" "Tent" "Cave"
## [25] "Camper/RV" "Bungalow" "Earth house"
## [28] "Lighthouse" "Dome house"
We can observe that there is some new levels in test set. Also, our property_type variable is high cardinality, which means it has many categories. Such variable is not suitable for one-hot encoding. One-hot encoding add too many new featuresto our model, causing curse of dimensionality. That is something we should prevent.
My approach is using mean-encoding to encode all levels. We do encoding on the train set, and apply it to the test set. If we encounter a new category. The Vtreat package handles everything for us. Also the Vtreat package use 5 folds cross validation to prevent overfitting.
First, we explore the average price for each property type. The red dot in each box represents its mean.
ggplot(data = train,
aes(x = property_type, y = price)) +
geom_boxplot() +
stat_summary(fun = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE) +
coord_flip()
However, when We looking into the distribution of number of unique value, we found it was highly unbalanced.
The distribution is so skewed, and it hits me that we can convert all rare categories into one category called “Other”. This is a very common approach for handling unbalanced categorical feature.
all%>%
count(property_type, sort=T)%>%
ggplot(aes(reorder(property_type, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=6))+
labs(x='Unique categories', y='Times appears in dataset')
all%>%
count(property_type, sort=T)%>%
mutate(total = sum(n),
perc = n/total,
cum_perc = cumsum(perc),
nums = 1:n_distinct(property_type))%>%
ggplot(aes(cum_perc, nums))+
geom_line()+
geom_vline(xintercept = .90, linetype='dashed')+
labs(x='Cumulative Percentage',y='Number of unique factor levels', title='How many unique levels capture 90% of all observed levels?')+
theme_minimal()
Wow, It seems that only 3 factors contributed to 90% of observed value.
One common approach for this issue is to collapse small categories to one category.
for (i in 1:nrow(all)) {
if (all$property_type[i] %in% c("Apartment","Townhouse","House","Condominium","Loft","Guest suite","Service apartment","Boutique hotel","Hotel") == FALSE){
all$property_type[i] = "Other"
}
}
Examine the distribution again
all%>%
count(property_type, sort=T)%>%
ggplot(aes(reorder(property_type, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=6))+
labs(x='Unique categories', y='Times appears in dataset')
ggplot(data = train,
aes(x = room_type, y = price)) +
geom_boxplot() +
stat_summary(fun = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE) +
coord_flip()
all%>%
count(room_type, sort=T)%>%
ggplot(aes(reorder(room_type, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=10))+
labs(x='Unique categories', y='Times appears in dataset')
ggplot(data = train,
aes(x = bed_type, y = price)) +
geom_boxplot() +
stat_summary(fun = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3,
show.legend = FALSE) +
coord_flip()
all%>%
count(bed_type, sort=T)%>%
ggplot(aes(reorder(bed_type, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=10))+
labs(x='Unique categories', y='Times appears in dataset')
Convert property/room/bed to factor, I would use one-hot encoding on these features later.
fc <- c("property_type","bed_type","room_type")
all <- all %>%
mutate_at(fc,as.factor)
These four variables contains 1/3 missing values. We will exclude them from our dataset.
all <- all %>%
select(!c("monthly_price","square_feet","weekly_price","security_deposit"))
which(colSums(is.na(all)) > 0)
## host_listings_count host_total_listings_count beds
## 8 9 22
## price cleaning_fee reviews_per_month
## 25 26 61
all <- all %>%
mutate_at(c("cleaning_fee","beds","host_listings_count","host_total_listings_count",
"reviews_per_month"),
~replace(., is.na(.),0))
Now we can see in numeric variables there is only price has missing value.
which(colSums(is.na(all)) > 0)
## price
## 25
Review scores rating is on 1-100 scale, we can convert it to 1-10 scale.
all <- all %>%
mutate(review_scores_rating = review_scores_rating/10)
all <- all %>%
mutate_at("cancellation_policy",as.factor)
We have ZIP code of houses, and the overall quality of Zip code data looks good (some zip codes need to be cleaned).
I am thinking to use Zip code and its longitude and latitude, we could have geographic variables. To my knowledge, this two variables can be very helpful when we use tree-based model to train our data. Therefore, we need another dataset which contains Zip code info, longitude and latitude info. We can join data_drop table with ZIP code Info table to get the result.
Let’s first clean our Zip code variable.
Notice that our Zip codes should be 5 digits (houses are located in NY). Check how many digits our ZIP codes have.
unique(nchar(all$zipcode))
## [1] 5 0 8 6 10 1 4 11
“6” means there are some ZIP codes contain spaces, let’s remove them.
library(stringr)
all$zipcode <- str_remove_all(all$zipcode," ")
We have 4 digits Zip code because some houses has the wrong Zip code, we should remove them from our data. We search the zip code of East Village, the corret zipcode should be 10009.
all[nchar(all$zipcode)==4,c("zipcode","neighbourhood")]
## zipcode neighbourhood
## 38058 1009 East Village
all[nchar(all$zipcode)==4,c("zipcode")] <- "10009"
Some Zip codes contain character “NY”! That’s why we have 8 digits Zip code.
We just want the numbers, we don’t want “NY” to show up in our Zip code!
library(stringr)
all$zipcode = str_replace_all(all$zipcode,"NY","")
unique(nchar(all$zipcode))
## [1] 5 0 10 11
It Looks like we successfully remove “NY”.
We have 10 digits Zip code because Some ZIP codes includes the street number, we should exclude street character. For example:
all$zipcode<- sapply(all$zipcode, substring, 1, 5)
unique(nchar(all$zipcode))
## [1] 5 0
There are some empty strings in Zip code variables,How to impute them? I am going to use neighbourhood to impute zipcode
all[all$zipcode =="",c("zipcode","neighbourhood")] %>%
arrange(neighbourhood)
## zipcode neighbourhood
## 1 Alphabet City
## 2 Alphabet City
## 3 Alphabet City
## 4 Alphabet City
## 5 Alphabet City
## 6 Arrochar
## 7 Astoria
## 8 Astoria
## 9 Astoria
## 10 Astoria
## 11 Astoria
## 12 Astoria
## 13 Battery Park City
## 14 Bedford-Stuyvesant
## 15 Bedford-Stuyvesant
## 16 Bedford-Stuyvesant
## 17 Bedford-Stuyvesant
## 18 Bedford-Stuyvesant
## 19 Bedford-Stuyvesant
## 20 Bedford-Stuyvesant
## 21 Bedford-Stuyvesant
## 22 Bedford-Stuyvesant
## 23 Bedford-Stuyvesant
## 24 Bedford-Stuyvesant
## 25 Bedford-Stuyvesant
## 26 Bedford-Stuyvesant
## 27 Bedford-Stuyvesant
## 28 Bedford-Stuyvesant
## 29 Bedford-Stuyvesant
## 30 Bedford-Stuyvesant
## 31 Bedford-Stuyvesant
## 32 Bedford-Stuyvesant
## 33 Bedford-Stuyvesant
## 34 Bedford-Stuyvesant
## 35 Boerum Hill
## 36 Boerum Hill
## 37 Brooklyn
## 38 Brooklyn
## 39 Brooklyn
## 40 Brooklyn
## 41 Brooklyn
## 42 Brooklyn
## 43 Brooklyn
## 44 Brooklyn
## 45 Brooklyn
## 46 Brooklyn
## 47 Brooklyn
## 48 Brooklyn
## 49 Brooklyn
## 50 Brooklyn
## 51 Brooklyn
## 52 Brooklyn
## 53 Brooklyn
## 54 Brooklyn
## 55 Brooklyn
## 56 Brooklyn
## 57 Brooklyn
## 58 Brooklyn
## 59 Brooklyn
## 60 Brooklyn
## 61 Brooklyn
## 62 Brooklyn
## 63 Brooklyn
## 64 Brooklyn
## 65 Brooklyn
## 66 Brooklyn
## 67 Brooklyn
## 68 Brooklyn
## 69 Brooklyn
## 70 Brooklyn
## 71 Brooklyn
## 72 Brooklyn
## 73 Brooklyn
## 74 Brooklyn
## 75 Brooklyn
## 76 Brooklyn
## 77 Brooklyn
## 78 Brooklyn
## 79 Brooklyn
## 80 Brooklyn
## 81 Brooklyn
## 82 Brooklyn
## 83 Brooklyn
## 84 Brooklyn
## 85 Brooklyn
## 86 Brooklyn
## 87 Brooklyn
## 88 Brooklyn
## 89 Brooklyn
## 90 Brooklyn
## 91 Brooklyn
## 92 Brooklyn
## 93 Brooklyn
## 94 Brooklyn
## 95 Brooklyn
## 96 Brooklyn
## 97 Brooklyn
## 98 Brooklyn
## 99 Brooklyn
## 100 Brooklyn
## 101 Brooklyn
## 102 Brooklyn
## 103 Brooklyn
## 104 Brooklyn
## 105 Brooklyn
## 106 Brooklyn
## 107 Brooklyn
## 108 Brooklyn
## 109 Brooklyn
## 110 Brooklyn
## 111 Brooklyn
## 112 Brooklyn
## 113 Brooklyn
## 114 Brooklyn
## 115 Brooklyn
## 116 Brooklyn
## 117 Brooklyn
## 118 Brooklyn
## 119 Brooklyn
## 120 Brooklyn
## 121 Brooklyn
## 122 Brooklyn
## 123 Brooklyn
## 124 Brooklyn
## 125 Brooklyn
## 126 Brooklyn
## 127 Brooklyn
## 128 Brooklyn
## 129 Brooklyn
## 130 Bushwick
## 131 Bushwick
## 132 Bushwick
## 133 Bushwick
## 134 Bushwick
## 135 Bushwick
## 136 Bushwick
## 137 Bushwick
## 138 Bushwick
## 139 Bushwick
## 140 Chelsea
## 141 Chelsea
## 142 Chelsea
## 143 Chelsea
## 144 Chelsea
## 145 Chelsea
## 146 Chelsea
## 147 Chelsea
## 148 Chelsea
## 149 Chelsea
## 150 Chinatown
## 151 Chinatown
## 152 Chinatown
## 153 Clinton Hill
## 154 Clinton Hill
## 155 Clinton Hill
## 156 Clinton Hill
## 157 Concourse
## 158 Crotona
## 159 Crown Heights
## 160 Crown Heights
## 161 Crown Heights
## 162 Crown Heights
## 163 Downtown Brooklyn
## 164 Downtown Brooklyn
## 165 East Elmhurst
## 166 East Flatbush
## 167 East Harlem
## 168 East Harlem
## 169 East Harlem
## 170 East Harlem
## 171 East New York
## 172 East New York
## 173 East New York
## 174 East New York
## 175 East Village
## 176 East Village
## 177 East Village
## 178 East Village
## 179 East Village
## 180 East Village
## 181 East Village
## 182 East Village
## 183 East Village
## 184 East Village
## 185 Elmhurst
## 186 Financial District
## 187 Financial District
## 188 Financial District
## 189 Financial District
## 190 Financial District
## 191 Flatbush
## 192 Flatbush
## 193 Flatbush
## 194 Flatiron District
## 195 Flatiron District
## 196 Flushing
## 197 Flushing
## 198 Fort Greene
## 199 Gramercy Park
## 200 Gravesend
## 201 Greenpoint
## 202 Greenpoint
## 203 Greenpoint
## 204 Grymes Hill
## 205 Hamilton Heights
## 206 Hamilton Heights
## 207 Hamilton Heights
## 208 Hamilton Heights
## 209 Hamilton Heights
## 210 Harlem
## 211 Harlem
## 212 Harlem
## 213 Harlem
## 214 Harlem
## 215 Harlem
## 216 Harlem
## 217 Harlem
## 218 Harlem
## 219 Harlem
## 220 Harlem
## 221 Harlem
## 222 Harlem
## 223 Harlem
## 224 Harlem
## 225 Hell's Kitchen
## 226 Hell's Kitchen
## 227 Hell's Kitchen
## 228 Hell's Kitchen
## 229 Hell's Kitchen
## 230 Howard Beach
## 231 Inwood
## 232 Kew Garden Hills
## 233 Kips Bay
## 234 Kips Bay
## 235 Lefferts Garden
## 236 Lefferts Garden
## 237 Lefferts Garden
## 238 Long Island City
## 239 Lower East Side
## 240 Lower East Side
## 241 Lower East Side
## 242 Lower East Side
## 243 Lower East Side
## 244 Lower East Side
## 245 Lower East Side
## 246 Lower East Side
## 247 Manhattan
## 248 Manhattan
## 249 Manhattan
## 250 Manhattan
## 251 Manhattan
## 252 Manhattan
## 253 Manhattan
## 254 Manhattan
## 255 Manhattan
## 256 Manhattan
## 257 Manhattan
## 258 Manhattan
## 259 Manhattan
## 260 Manhattan
## 261 Manhattan
## 262 Manhattan
## 263 Manhattan
## 264 Manhattan
## 265 Manhattan
## 266 Manhattan
## 267 Manhattan
## 268 Manhattan
## 269 Manhattan
## 270 Manhattan
## 271 Manhattan
## 272 Manhattan
## 273 Manhattan
## 274 Manhattan
## 275 Manhattan
## 276 Manhattan
## 277 Manhattan
## 278 Manhattan
## 279 Manhattan
## 280 Manhattan
## 281 Manhattan
## 282 Manhattan
## 283 Manhattan
## 284 Manhattan
## 285 Manhattan
## 286 Manhattan
## 287 Manhattan
## 288 Manhattan
## 289 Manhattan
## 290 Manhattan
## 291 Manhattan
## 292 Manhattan
## 293 Manhattan
## 294 Manhattan
## 295 Manhattan
## 296 Manhattan
## 297 Manhattan
## 298 Manhattan
## 299 Manhattan
## 300 Manhattan
## 301 Manhattan
## 302 Manhattan
## 303 Manhattan
## 304 Manhattan
## 305 Manhattan
## 306 Manhattan
## 307 Manhattan
## 308 Manhattan
## 309 Manhattan
## 310 Manhattan
## 311 Manhattan
## 312 Manhattan
## 313 Manhattan
## 314 Manhattan
## 315 Manhattan
## 316 Manhattan
## 317 Manhattan
## 318 Manhattan
## 319 Manhattan
## 320 Manhattan
## 321 Manhattan
## 322 Manhattan
## 323 Manhattan
## 324 Manhattan
## 325 Manhattan
## 326 Manhattan
## 327 Manhattan
## 328 Manhattan
## 329 Manhattan
## 330 Manhattan
## 331 Manhattan
## 332 Manhattan
## 333 Manhattan
## 334 Manhattan
## 335 Manhattan
## 336 Manhattan
## 337 Manhattan
## 338 Manhattan
## 339 Manhattan
## 340 Manhattan
## 341 Manhattan
## 342 Manhattan
## 343 Manhattan
## 344 Manhattan
## 345 Manhattan
## 346 Manhattan
## 347 Maspeth
## 348 Midtown
## 349 Midtown
## 350 Midtown
## 351 Midtown
## 352 Midtown
## 353 Midtown
## 354 Midtown
## 355 Midtown
## 356 Midtown
## 357 Midtown
## 358 Midtown
## 359 Midtown
## 360 Midtown
## 361 Midtown East
## 362 Midtown East
## 363 Midtown East
## 364 Midwood
## 365 Morningside Heights
## 366 Mott Haven
## 367 Park Slope
## 368 Prospect Heights
## 369 Queens
## 370 Queens
## 371 Queens
## 372 Queens
## 373 Queens
## 374 Queens
## 375 Queens
## 376 Queens
## 377 Queens
## 378 Queens
## 379 Queens
## 380 Queens
## 381 Queens
## 382 Queens
## 383 Queens
## 384 Queens
## 385 Queens
## 386 Queens
## 387 Queens
## 388 Queens
## 389 Queens
## 390 Queens
## 391 Rego Park
## 392 Richmond Hill
## 393 Ridgewood
## 394 South Ozone Park
## 395 Sunnyside
## 396 Sunnyside
## 397 Sunnyside
## 398 Sunnyside
## 399 Sunset Park
## 400 Sunset Park
## 401 The Bronx
## 402 The Bronx
## 403 The Bronx
## 404 The Bronx
## 405 The Bronx
## 406 Upper East Side
## 407 Upper East Side
## 408 Upper East Side
## 409 Upper East Side
## 410 Upper East Side
## 411 Upper East Side
## 412 Upper East Side
## 413 Upper East Side
## 414 Upper East Side
## 415 Upper East Side
## 416 Upper East Side
## 417 Upper West Side
## 418 Upper West Side
## 419 Upper West Side
## 420 Washington Heights
## 421 Washington Heights
## 422 Washington Heights
## 423 West Village
## 424 West Village
## 425 West Village
## 426 Westerleigh
## 427 Whitestone
## 428 Williamsburg
## 429 Williamsburg
## 430 Williamsburg
## 431 Williamsburg
## 432 Williamsburg
## 433 Williamsburg
## 434 Williamsburg
## 435 Williamsburg
## 436 Williamsburg
## 437 Williamsburg
## 438 Williamsburg
## 439 Williamsburg
## 440 Williamsburg
## 441 Williamsburg
## 442 Williamsburg
## 443 Williamsburg
all[all$neighbourhood =="",c("zipcode","neighbourhood")]
## zipcode neighbourhood
## 1956 11001
## 2537 11363
## 2823 10031
## 3821 11363
## 3826 10002
## 4406 11237
## 4441 11001
## 4711 11211
## 4862 10453
## 5376 11222
## 7087 11213
## 7273 11362
## 8182 11219
## 8831 11211
## 9001 11231
## 9195 10705
## 10107 11211
## 10268 10002
## 11362 10016
## 11639 10019
## 11648 11355
## 12353 11385
## 12357 11385
## 12637 10469
## 12744 11362
## 13354 11211
## 13748 11105
## 14334 10455
## 14662 10005
## 16204 11004
## 16790 10467
## 20036 11369
## 21393 11362
## 21593 11362
## 22141 10031
## 22930 11363
## 22959 10016
## 23438 11369
## 23897 11362
## 24010 10010
## 24154 11211
## 26113 10028
## 26291 11221
## 26926 10030
## 27259 11234
## 28633 11362
## 30910 11211
## 32446 11221
## 32849 10019
## 33061 10033
## 33081 11237
## 33576 10036
## 33974 10032
## 36113 11211
## 36520 11372
## 36556 10035
## 37112 10038
## 37128 11370
## 37141 11237
## 37584 10456
## 39027 10009
## 51626 11102
## 51627 11206
## 51628 10030
## 51629 10033
## 51630 11211
## 51631 11237
## 51632 11355
## 51633 11211
## 51634 10309
## 51635 11225
## 51636 10002
## 51637 11101
## 51638 10038
## 51639 11216
## 51640 11234
## 51641 11692
## 51642 10013
all[all$zipcode ==""& all$neighbourhood=="",]
## [1] id
## [2] host_since
## [3] host_response_time
## [4] host_response_rate
## [5] host_acceptance_rate
## [6] host_is_superhost
## [7] host_neighbourhood
## [8] host_listings_count
## [9] host_total_listings_count
## [10] host_verifications
## [11] host_identity_verified
## [12] neighbourhood
## [13] neighbourhood_cleansed
## [14] neighbourhood_group_cleansed
## [15] zipcode
## [16] is_location_exact
## [17] property_type
## [18] room_type
## [19] accommodates
## [20] bathrooms
## [21] bedrooms
## [22] beds
## [23] bed_type
## [24] amenities
## [25] price
## [26] cleaning_fee
## [27] guests_included
## [28] extra_people
## [29] minimum_nights
## [30] maximum_nights
## [31] minimum_minimum_nights
## [32] maximum_minimum_nights
## [33] minimum_maximum_nights
## [34] maximum_maximum_nights
## [35] minimum_nights_avg_ntm
## [36] maximum_nights_avg_ntm
## [37] calendar_updated
## [38] availability_30
## [39] availability_60
## [40] availability_90
## [41] availability_365
## [42] number_of_reviews
## [43] number_of_reviews_ltm
## [44] first_review
## [45] last_review
## [46] review_scores_rating
## [47] review_scores_accuracy
## [48] review_scores_cleanliness
## [49] review_scores_checkin
## [50] review_scores_communication
## [51] review_scores_location
## [52] review_scores_value
## [53] instant_bookable
## [54] cancellation_policy
## [55] require_guest_profile_picture
## [56] require_guest_phone_verification
## [57] calculated_host_listings_count
## [58] calculated_host_listings_count_entire_homes
## [59] calculated_host_listings_count_private_rooms
## [60] calculated_host_listings_count_shared_rooms
## [61] reviews_per_month
## <0 rows> (or 0-length row.names)
dim(all)
## [1] 51663 61
So we can see that some missing zip codes has neighbourhood information, so I will try to impute missing zipcode with the highest frequency zipcode in its neighbourhood. (I have no idea if it is the correct way to do it)
zipcode_max <- all %>%
count(neighbourhood_cleansed,zipcode) %>%
group_by(neighbourhood_cleansed) %>%
filter(n == max(n))
as.data.frame(zipcode_max)
## neighbourhood_cleansed zipcode n
## 1 Allerton 10469 28
## 2 Arden Heights 10312 4
## 3 Arrochar 10305 24
## 4 Arverne 11692 94
## 5 Astoria 11103 398
## 6 Bath Beach 11214 23
## 7 Battery Park City 10280 29
## 8 Bay Ridge 11209 136
## 9 Bay Terrace 11360 5
## 10 Bay Terrace, Staten Island 10308 2
## 11 Baychester 10469 7
## 12 Bayside 11361 29
## 13 Bayswater 11691 19
## 14 Bedford-Stuyvesant 11221 1164
## 15 Belle Harbor 11694 6
## 16 Bellerose 11426 6
## 17 Belmont 10458 36
## 18 Bensonhurst 11214 32
## 19 Bergen Beach 11234 12
## 20 Boerum Hill 11217 115
## 21 Borough Park 11219 95
## 22 Breezy Point 11694 2
## 23 Briarwood 11435 43
## 24 Brighton Beach 11235 81
## 25 Bronxdale 10462 21
## 26 Brooklyn Heights 11201 149
## 27 Brownsville 11212 54
## 28 Bull's Head 10314 4
## 29 Bushwick 11237 957
## 30 Cambria Heights 11411 23
## 31 Canarsie 11236 208
## 32 Carroll Gardens 11231 214
## 33 Castle Hill 10473 4
## 34 Castleton Corners 10314 2
## 35 Chelsea 10011 683
## 36 Chinatown 10002 333
## 37 City Island 10464 18
## 38 Civic Center 10038 24
## 39 Claremont Village 10456 15
## 40 Clason Point 10473 23
## 41 Clifton 10304 21
## 42 Clinton Hill 11238 355
## 43 Co-op City 10475 4
## 44 Cobble Hill 11201 63
## 45 College Point 11356 25
## 46 Columbia St 11231 37
## 47 Concord 10305 14
## 48 Concourse 10451 28
## 49 Concourse Village 10456 18
## 50 Coney Island 11224 20
## 51 Corona 11368 89
## 52 Country Club 10465 2
## 53 Crown Heights 11213 548
## 54 Cypress Hills 11208 101
## 55 Ditmars Steinway 11105 263
## 56 Dongan Hills 10306 4
## 57 Douglaston 11362 7
## 58 Downtown Brooklyn 11201 70
## 59 DUMBO 11201 30
## 60 Dyker Heights 11228 13
## 61 East Elmhurst 11369 177
## 62 East Flatbush 11203 277
## 63 East Harlem 10029 796
## 64 East Morrisania 10459 9
## 65 East New York 11207 181
## 66 East Village 10009 1137
## 67 Eastchester 10475 12
## 68 Edenwald 10466 16
## 69 Edgemere 11691 9
## 70 Elmhurst 11373 288
## 71 Eltingville 10308 1
## 72 Eltingville 10312 1
## 73 Emerson Hill 10314 4
## 74 Far Rockaway 11691 31
## 75 Fieldston 10471 8
## 76 Financial District 10005 359
## 77 Flatbush 11226 485
## 78 Flatiron District 10010 65
## 79 Flatlands 11234 67
## 80 Flushing 11355 304
## 81 Fordham 10458 25
## 82 Fordham 10468 25
## 83 Forest Hills 11375 140
## 84 Fort Greene 11205 207
## 85 Fort Hamilton 11209 57
## 86 Fresh Meadows 11365 29
## 87 Glen Oaks 11004 1
## 88 Glendale 11385 73
## 89 Gowanus 11215 125
## 90 Gramercy 10003 232
## 91 Graniteville 10303 4
## 92 Grant City 10306 8
## 93 Gravesend 11223 71
## 94 Great Kills 10308 12
## 95 Greenpoint 11222 1023
## 96 Greenwich Village 10012 226
## 97 Grymes Hill 10304 6
## 98 Harlem 10031 844
## 99 Hell's Kitchen 10019 1015
## 100 Highbridge 10452 32
## 101 Hollis 11423 15
## 102 Holliswood 11423 2
## 103 Howard Beach 11414 24
## 104 Howland Hook 10303 2
## 105 Huguenot 10312 3
## 106 Hunts Point 10474 24
## 107 Inwood 10034 168
## 108 Jackson Heights 11372 168
## 109 Jamaica 11434 170
## 110 Jamaica Estates 11432 17
## 111 Jamaica Hills 11432 14
## 112 Kensington 11218 166
## 113 Kew Gardens 11415 31
## 114 Kew Gardens Hills 11367 23
## 115 Kingsbridge 10463 43
## 116 Kips Bay 10016 299
## 117 Laurelton 11413 35
## 118 Lighthouse Hill 10306 2
## 119 Little Italy 10013 152
## 120 Little Neck 11363 2
## 121 Long Island City 11101 396
## 122 Longwood 10456 22
## 123 Longwood 10459 22
## 124 Lower East Side 10002 944
## 125 Manhattan Beach 11235 10
## 126 Marble Hill 10463 11
## 127 Mariners Harbor 10303 11
## 128 Maspeth 11378 77
## 129 Melrose 10451 10
## 130 Middle Village 11379 39
## 131 Midland Beach 10306 8
## 132 Midtown 10022 411
## 133 Midwood 11230 78
## 134 Mill Basin 11234 11
## 135 Morningside Heights 10025 137
## 136 Morris Heights 10453 19
## 137 Morris Park 10461 12
## 138 Morrisania 10456 7
## 139 Mott Haven 10454 43
## 140 Mount Eden 10452 6
## 141 Mount Hope 10457 20
## 142 Murray Hill 10016 400
## 143 Navy Yard 11205 9
## 144 Neponsit 11694 3
## 145 New Brighton 10301 5
## 146 New Dorp Beach 10306 4
## 147 New Springville 10314 7
## 148 NoHo 10012 42
## 149 Nolita 10012 240
## 150 North Riverdale 10705 9
## 151 Norwood 10467 26
## 152 Oakwood 10306 7
## 153 Olinville 10467 5
## 154 Ozone Park 11417 44
## 155 Park Slope 11215 296
## 156 Parkchester 10462 24
## 157 Pelham Bay 10461 15
## 158 Pelham Gardens 10469 31
## 159 Port Morris 10454 26
## 160 Port Richmond 10302 10
## 161 Prince's Bay 10309 1
## 162 Prince's Bay 10312 1
## 163 Prospect Heights 11238 302
## 164 Prospect-Lefferts Gardens 11225 396
## 165 Queens Village 11429 44
## 166 Randall Manor 10310 18
## 167 Red Hook 11231 90
## 168 Rego Park 11374 108
## 169 Richmond Hill 11418 53
## 170 Richmondtown 10306 2
## 171 Ridgewood 11385 413
## 172 Riverdale 10463 7
## 173 Rockaway Beach 11693 71
## 174 Roosevelt Island 10044 82
## 175 Rosebank 10305 10
## 176 Rosedale 11422 83
## 177 Rossville 10309 3
## 178 Schuylerville 10461 8
## 179 Sea Gate 11224 7
## 180 Sheepshead Bay 11229 91
## 181 Shore Acres 10305 8
## 182 Silver Lake 10301 2
## 183 SoHo 10012 186
## 184 Soundview 10472 11
## 185 South Beach 10305 8
## 186 South Ozone Park 11420 26
## 187 South Slope 11215 287
## 188 Springfield Gardens 11413 59
## 189 Spuyten Duyvil 10463 5
## 190 St. Albans 11412 53
## 191 St. George 10301 53
## 192 Stapleton 10301 21
## 193 Stuyvesant Town 10009 34
## 194 Sunnyside 11104 242
## 195 Sunset Park 11220 188
## 196 Theater District 10036 125
## 197 Throgs Neck 10465 36
## 198 Todt Hill 10314 2
## 199 Tompkinsville 10301 37
## 200 Tottenville 10307 4
## 201 Tremont 10457 13
## 202 Tribeca 10013 97
## 203 Two Bridges 10002 68
## 204 Unionport 10472 6
## 205 University Heights 10453 17
## 206 Upper East Side 10128 416
## 207 Upper West Side 10025 845
## 208 Van Nest 10462 10
## 209 Vinegar Hill 11201 31
## 210 Wakefield 10466 60
## 211 Washington Heights 10032 520
## 212 West Brighton 10310 23
## 213 West Farms 10460 3
## 214 West Village 10014 688
## 215 Westchester Square 10461 9
## 216 Westerleigh 10314 7
## 217 Whitestone 11357 17
## 218 Williamsbridge 10469 26
## 219 Williamsburg 11211 2223
## 220 Willowbrook 10314 1
## 221 Windsor Terrace 11218 91
## 222 Woodhaven 11421 105
## 223 Woodlawn 10470 8
## 224 Woodside 11377 259
impute all the missing zipcode
for (i in 1:nrow(all)){
if(all$zipcode[i]==""){
all$zipcode[i]= pull(zipcode_max[zipcode_max$neighbourhood_cleansed == all$neighbourhood_cleansed[i],"zipcode"])
}
}
dim(all)
## [1] 51663 61
sum(all$zipcode == "")
## [1] 0
Hmisc::describe(all$zipcode)
## all$zipcode
## n missing distinct
## 51663 0 191
##
## lowest : 10001 10002 10003 10004 10005, highest: 11691 11692 11693 11694 13355
dim(all)
## [1] 51663 61
all%>%
count(zipcode, sort=T)%>%
ggplot(aes(reorder(zipcode, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=2))+
labs(x='Unique categories', y='Times appears in dataset')
all %>%
group_by(zipcode) %>%
count() %>%
arrange(desc(n))
## # A tibble: 191 x 2
## # Groups: zipcode [191]
## zipcode n
## <chr> <int>
## 1 11211 2302
## 2 11221 2104
## 3 11206 1529
## 4 11216 1439
## 5 10002 1411
## 6 10019 1350
## 7 10009 1204
## 8 11238 1150
## 9 11233 1126
## 10 11222 1106
## # … with 181 more rows
Mean-encode our zipcode
dTrainN <- train
# NOT RUN {
# numeric example
set.seed(23525)
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsN
# and dTrainNTreated
unpack[
treatmentsN = treatments,
dTrainNTreated = crossFrame
] <- mkCrossFrameNExperiment(
dframe = dTrainN,
varlist = setdiff("zipcode", 'price'),
outcomename = 'price',
verbose = FALSE)
# the treatments include a score frame relating new
# derived variables to original columns
treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
print(.)
## origName varName code rsq sig
## 1 zipcode zipcode_catP catP 1.855895e-03 1.919999e-18
## 2 zipcode zipcode_catN catN 1.902765e-01 0.000000e+00
## 3 zipcode zipcode_catD catD 1.514050e-01 0.000000e+00
## 4 zipcode zipcode_lev_x_10002 lev 1.463277e-03 7.291330e-15
## 5 zipcode zipcode_lev_x_10009 lev 1.959967e-03 2.167616e-19
## 6 zipcode zipcode_lev_x_10019 lev 7.593893e-03 1.748099e-70
## 7 zipcode zipcode_lev_x_11206 lev 3.984955e-03 9.068693e-38
## 8 zipcode zipcode_lev_x_11211 lev 8.112251e-04 6.983097e-09
## 9 zipcode zipcode_lev_x_11216 lev 1.552024e-03 1.129175e-15
## 10 zipcode zipcode_lev_x_11221 lev 5.878451e-03 6.266125e-55
## 11 zipcode zipcode_lev_x_11222 lev 5.018784e-07 8.854857e-01
## 12 zipcode zipcode_lev_x_11233 lev 2.541367e-03 1.125461e-24
## 13 zipcode zipcode_lev_x_11237 lev 4.820195e-03 2.442631e-45
## 14 zipcode zipcode_lev_x_11238 lev 1.391928e-05 4.481790e-01
## extraModelDegrees
## 1 208
## 2 208
## 3 208
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 0
# the treated frame is a "cross frame" which
# is a transform of the training data built
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
dTrainNTreated %.>%
head(.) %.>%
print(.)
## zipcode_catP zipcode_catN zipcode_catD zipcode_lev_x_10002
## 1 0.012811207 -59.044505 47.52400 0
## 2 0.021957682 -2.906385 104.05159 0
## 3 0.021013283 -1.465574 103.63760 0
## 4 0.007367619 91.877609 189.38397 0
## 5 0.004573005 -54.118285 42.04390 0
## 6 0.007295031 -40.282820 57.64091 0
## zipcode_lev_x_10009 zipcode_lev_x_10019 zipcode_lev_x_11206
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## zipcode_lev_x_11211 zipcode_lev_x_11216 zipcode_lev_x_11221
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## zipcode_lev_x_11222 zipcode_lev_x_11233 zipcode_lev_x_11237
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## zipcode_lev_x_11238 price
## 1 0 145
## 2 1 199
## 3 1 165
## 4 0 59
## 5 0 48
## 6 0 80
# Any future application data is prepared with
# the prepare method.
dTestNTreated <- prepare(treatmentsN, test, pruneSig=NULL)
dTestNTreated %.>%
head(.) %.>%
print(.)
## zipcode_catP zipcode_catN zipcode_catD zipcode_lev_x_10002
## 1 0.01541253 -13.76147 89.82499 0
## 2 0.01613840 -39.19542 76.89985 0
## 3 0.02298572 32.05681 109.19339 0
## 4 0.01696105 13.44983 94.44524 0
## 5 0.01592064 35.58320 114.96631 0
## 6 0.01613840 -39.19542 76.89985 0
## zipcode_lev_x_10009 zipcode_lev_x_10019 zipcode_lev_x_11206
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## zipcode_lev_x_11211 zipcode_lev_x_11216 zipcode_lev_x_11221
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## zipcode_lev_x_11222 zipcode_lev_x_11233 zipcode_lev_x_11237
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## zipcode_lev_x_11238 price
## 1 0 NA
## 2 0 NA
## 3 0 NA
## 4 0 NA
## 5 0 NA
## 6 0 NA
# }
all[1:41330,"zipcode_m"] <- dTrainNTreated$zipcode_catN
all[41331:51663,"zipcode_m"]<- dTestNTreated$zipcode_catN
all[,c("zipcode_m")] <- sapply(all[,c("zipcode_m")],as.numeric)
head(all$zipcode_m)
## [1] -59.044505 -2.906385 -1.465574 91.877609 -54.118285 -40.282820
ggplot(all,aes(zipcode_m))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(all,aes(zipcode_m,price))+
geom_point()
## Warning: Removed 10333 rows containing missing values (geom_point).
We can see that our neighbourhood variable has so many levels. First thought is using one-hot encoding, but this measure could lead to very large dimension in our dataset, causing multicollinearity and curse of dimension. Also, I would use impact encoding of Vtreat package instead of collapsing rare categories. I experiment two approaches and the impact encoding gave me a lower RMSE on local test cv score.
Hmisc::describe(all$host_neighbourhood)
## all$host_neighbourhood
## n missing distinct
## 44532 7131 439
##
## lowest : 7th Arrondissement Abbotsford Adams Point Alésia Algirós
## highest: X Arrondissement XI Arrondissement Yehuda Hamaccabi Yorkville Zona 8
all%>%
count(host_neighbourhood, sort=T)%>%
ggplot(aes(reorder(host_neighbourhood, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=2))+
labs(x='Unique categories', y='Times appears in dataset')
Hmisc::describe(all$neighbourhood)
## all$neighbourhood
## n missing distinct
## 51585 78 202
##
## lowest : Allerton Alphabet City Annadale Arden Heights Arrochar
## highest: Willowbrook Windsor Terrace Woodhaven Woodlawn Woodside
Hmisc::describe(all$neighbourhood_cleansed)
## all$neighbourhood_cleansed
## n missing distinct
## 51663 0 220
##
## lowest : Allerton Arden Heights Arrochar Arverne Astoria
## highest: Willowbrook Windsor Terrace Woodhaven Woodlawn Woodside
Distribution of neighbourhood_cleansed
all%>%
count(neighbourhood_cleansed, sort=T)%>%
ggplot(aes(reorder(neighbourhood_cleansed, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=2))+
labs(x='Unique categories', y='Times appears in dataset')
all%>%
count(neighbourhood_cleansed, sort=T)%>%
mutate(total = sum(n),
perc = n/total,
cum_perc = cumsum(perc),
nums = 1:n_distinct(neighbourhood_cleansed))%>%
ggplot(aes(cum_perc, nums))+
geom_line()+
geom_vline(xintercept = .80, linetype='dashed')+
labs(x='Cumulative Percentage',y='Number of unique factor levels', title='How many unique levels capture 80% of all observed levels?')+
theme_minimal()
Hmisc::describe(all$neighbourhood_group_cleansed)
## all$neighbourhood_group_cleansed
## n missing distinct
## 51663 0 5
##
## lowest : Bronx Brooklyn Manhattan Queens Staten Island
## highest: Bronx Brooklyn Manhattan Queens Staten Island
##
## Value Bronx Brooklyn Manhattan Queens
## Frequency 1289 21236 22138 6576
## Proportion 0.025 0.411 0.429 0.127
##
## Value Staten Island
## Frequency 424
## Proportion 0.008
all%>%
count(neighbourhood_group_cleansed, sort=T)%>%
ggplot(aes(reorder(neighbourhood_group_cleansed, n),n))+
geom_col()+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_text(size=8))+
labs(x='Unique categories', y='Times appears in dataset')
####Convert neighbourhood_group_cleansed to factor
all<- all %>%
mutate_at(c("neighbourhood_group_cleansed"),as.factor)
Use vtreat impact encoding host_neighbourhood and neighbourhood_cleansed
library(vtreat)
dTrainN <- all[1:41330,]
# NOT RUN {
# numeric example
set.seed(23525)
# we perform a vtreat cross frame experiment
# and unpack the results into treatmentsN
# and dTrainNTreated
unpack[
treatmentsN = treatments,
dTrainNTreated = crossFrame
] <- mkCrossFrameNExperiment(
dframe = dTrainN,
varlist = setdiff(c("host_neighbourhood","neighbourhood_cleansed"), 'price'),
outcomename = 'price',
verbose = FALSE)
# the treatments include a score frame relating new
# derived variables to original columns
treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')] %.>%
print(.)
## origName varName
## 1 host_neighbourhood host_neighbourhood_catP
## 2 host_neighbourhood host_neighbourhood_catN
## 3 host_neighbourhood host_neighbourhood_catD
## 4 neighbourhood_cleansed neighbourhood_cleansed_catP
## 5 neighbourhood_cleansed neighbourhood_cleansed_catN
## 6 neighbourhood_cleansed neighbourhood_cleansed_catD
## 7 host_neighbourhood host_neighbourhood_lev_x_
## 8 host_neighbourhood host_neighbourhood_lev_x_Astoria
## 9 host_neighbourhood host_neighbourhood_lev_x_Bedford_minus_Stuyvesant
## 10 host_neighbourhood host_neighbourhood_lev_x_Bushwick
## 11 host_neighbourhood host_neighbourhood_lev_x_Crown_Heights
## 12 host_neighbourhood host_neighbourhood_lev_x_Harlem
## 13 host_neighbourhood host_neighbourhood_lev_x_Hell_tick_s_Kitchen
## 14 host_neighbourhood host_neighbourhood_lev_x_Upper_East_Side
## 15 host_neighbourhood host_neighbourhood_lev_x_Upper_West_Side
## 16 host_neighbourhood host_neighbourhood_lev_x_Williamsburg
## 17 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Bedford_minus_Stuyvesant
## 18 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Bushwick
## 19 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Chelsea
## 20 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Crown_Heights
## 21 neighbourhood_cleansed neighbourhood_cleansed_lev_x_East_Harlem
## 22 neighbourhood_cleansed neighbourhood_cleansed_lev_x_East_Village
## 23 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Greenpoint
## 24 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Harlem
## 25 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Hell_tick_s_Kitchen
## 26 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Midtown
## 27 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Upper_East_Side
## 28 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Upper_West_Side
## 29 neighbourhood_cleansed neighbourhood_cleansed_lev_x_Williamsburg
## code rsq sig extraModelDegrees
## 1 catP 1.559984e-04 1.111069e-02 399
## 2 catN 1.523291e-01 0.000000e+00 399
## 3 catD 7.150960e-02 0.000000e+00 399
## 4 catP 1.262647e-04 2.234741e-02 217
## 5 catN 1.852532e-01 0.000000e+00 217
## 6 catD 1.471896e-01 0.000000e+00 217
## 7 lev 2.764632e-05 2.851102e-01 0
## 8 lev 2.356971e-03 5.324265e-23 0
## 9 lev 6.757545e-03 6.723372e-63 0
## 10 lev 8.964951e-03 6.306274e-83 0
## 11 lev 1.887332e-03 9.933154e-19 0
## 12 lev 6.989494e-04 7.637321e-08 0
## 13 lev 9.768384e-03 3.182998e-90 0
## 14 lev 1.597463e-03 4.347630e-16 0
## 15 lev 2.823125e-03 3.115498e-27 0
## 16 lev 3.020449e-04 4.103105e-04 0
## 17 lev 8.454522e-03 2.712710e-78 0
## 18 lev 1.111700e-02 1.755041e-102 0
## 19 lev 1.138014e-02 7.094750e-105 0
## 20 lev 2.117851e-03 7.938611e-21 0
## 21 lev 5.898481e-04 7.891580e-07 0
## 22 lev 4.712387e-03 2.316619e-44 0
## 23 lev 8.837375e-11 9.984752e-01 0
## 24 lev 2.059180e-03 2.712338e-20 0
## 25 lev 1.372127e-02 3.415772e-126 0
## 26 lev 2.696492e-02 1.162390e-247 0
## 27 lev 1.944887e-03 2.973161e-19 0
## 28 lev 4.128700e-03 4.515167e-39 0
## 29 lev 5.382014e-04 2.395449e-06 0
# the treated frame is a "cross frame" which
# is a transform of the training data built
# as if the treatment were learned on a different
# disjoint training set to avoid nested model
# bias and over-fit.
varnames <- treatmentsN$scoreFrame[treatmentsN$scoreFrame$code%in% c('catN'),'varName']
all1 <- prepare(treatmentsN, all, pruneSig=NULL,varRestriction = varnames)
all$host_neighbourhood <- all1$host_neighbourhood_catN
all$neighbourhood_cleansed <- all1$neighbourhood_cleansed_catN
all[,c("host_neighbourhood")] <- sapply(all[,c("host_neighbourhood")],as.numeric)
all[,c("neighbourhood_cleansed")] <- sapply(all[,c("neighbourhood_cleansed")],as.numeric)
which(is.na(all$neighbourhood_cleansed))
## integer(0)
cor(train$maximum_nights,train$price)
## [1] 0.0002808924
cor(train$minimum_nights,train$price)
## [1] 0.006085157
cor(train$maximum_nights_avg_ntm,train$price)
## [1] 0.02807403
ggplot(train,aes(minimum_nights_avg_ntm,price)) +
geom_point()
Hmisc::describe(all$first_review)
## all$first_review
## n missing distinct
## 51661 2 3253
##
## lowest : 2009-02-23 2009-03-12 2009-04-13 2009-04-20 2009-04-23
## highest: 2020-06-04 2020-06-05 2020-06-06 2020-06-07 2020-06-08
Hmisc::describe(all$last_review)
## all$last_review
## n missing distinct
## 51661 2 2093
##
## lowest : 2011-05-12 2011-08-29 2011-09-18 2011-09-19 2011-10-23
## highest: 2020-06-05 2020-06-06 2020-06-07 2020-06-08 2020-06-09
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
all$first_review <- ymd(all[,c("first_review")])
all$last_review <- ymd(all[,c("last_review")])
all<- all %>%
mutate(diff_day = as.double(difftime(all$last_review,all$first_review,units = "days")))
for (i in 1:nrow(all)){
if(is.na(all$diff_day[i])){
all$diff_day[i] <- median(all$diff_day,na.rm = TRUE)
}
}
amenities are very good feature for our model. Intuitively, the amenities may have correlation with price. But the question is, in what aspects?
My guess are the number of amenities, and the type of amenities.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following object is masked from 'package:wrapr':
##
## :=
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
all <- all %>%
mutate(amenities = gsub("\\[", "",amenities)) %>% ## remove [
mutate(amenities = gsub("\\]", "",amenities)) %>% ## remove ]
mutate(amenities = gsub("\\'", "",amenities)) %>% ## remove '
mutate(amenities = gsub("\\.", "",amenities)) %>% ## remove '
mutate(amenities = str_replace_all(amenities,"\\s", "")) %>%
mutate(amenities = noquote(amenities)) %>%
mutate(amenities = tolower(amenities)) %>%
mutate(number_amenities = sapply(strsplit(amenities,','), uniqueN))
cor(all$number_amenities,all$price,use="complete.obs")
## [1] 0.1807215
library(qdapTools)
##
## Attaching package: 'qdapTools'
## The following object is masked from 'package:data.table':
##
## shift
## The following object is masked from 'package:dplyr':
##
## id
all = cbind(all,mtabulate(strsplit(as.character(all$amenities), ',')))
head(all$amenities)
## [1] tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,essentials,hangers,hairdryer,selfcheck-in,keypad,privateentrance
## [2] tv,internet,wifi,kitchen,petsliveonthisproperty,dog(s),freestreetparking,buzzer/wirelessintercom,heating,family/kidfriendly,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,hairdryer,iron,laptopfriendlyworkspace,translationmissing:enhosting_amenity_50,selfcheck-in,lockbox,privatelivingroom,hotwater,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,patioorbalcony,gardenorbackyard,bakingsheet,trashcan
## [3] tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,selfcheck-in,smartlock,privatelivingroom,privateentrance,room-darkeningshades,hotwater,bedlinens,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,gardenorbackyard
## [4] wifi,airconditioning,kitchen,buzzer/wirelessintercom,heating,smokedetector,carbonmonoxidedetector,essentials,shampoo,hangers,hairdryer,iron,hotwater,microwave,refrigerator,dishesandsilverware,cookingbasics,oven,stove,hostgreetsyou
## [5] wifi,airconditioning,kitchen,freestreetparking,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,privateentrance,hotwater,microwave,coffeemaker,refrigerator,dishesandsilverware,cookingbasics,oven,stove
## [6] wifi,airconditioning,kitchen,heating,smokedetector,firstaidkit,essentials,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,coffeemaker
all <- all %>%
mutate_at(c("amenities"),as.character)
all1 <- all %>%
mutate(amenities = str_split(amenities, ",")) %>%
select(amenities)
words.freq<-table(unlist(all1));
words.freq<-sort(words.freq, decreasing=TRUE)
#names(words.freq)[71:153]
all = all[, !(colnames(all) %in% names(words.freq)[71:153])]
dim(all)
## [1] 51663 134
head(all)
## id host_since host_response_time host_response_rate host_acceptance_rate
## 1 914965 2019-06-13 within a few hours 70% N/A
## 2 262763 2011-06-08 within an hour 86% 25%
## 3 279053 2013-01-20 within a day 100% 66%
## 4 385319 2014-08-18 within an hour 100% 98%
## 5 476558 2018-09-14 within an hour 100% 100%
## 6 522365 2019-05-31 within a day 83% 90%
## host_is_superhost host_neighbourhood host_listings_count
## 1 0 -49.759367 0
## 2 0 1.451335 1
## 3 1 3.577991 1
## 4 0 72.433628 3
## 5 1 -55.268907 6
## 6 1 -35.597427 1
## host_total_listings_count
## 1 0
## 2 1
## 3 1
## 4 3
## 5 6
## 6 1
## host_verifications
## 1 ['phone']
## 2 ['email', 'phone', 'facebook', 'reviews', 'jumio', 'government_id', 'work_email']
## 3 ['email', 'phone', 'reviews', 'jumio', 'government_id']
## 4 ['email', 'phone', 'reviews', 'jumio', 'government_id']
## 5 ['email', 'phone']
## 6 ['phone', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']
## host_identity_verified neighbourhood neighbourhood_cleansed
## 1 0 Brooklyn -51.220420
## 2 1 Brooklyn 3.873703
## 3 0 Prospect Heights 3.873703
## 4 1 Manhattan 71.085886
## 5 0 Sunnyside -58.480477
## 6 0 Queens -38.990225
## neighbourhood_group_cleansed zipcode is_location_exact property_type
## 1 Brooklyn 11207 1 Condominium
## 2 Brooklyn 11238 0 Apartment
## 3 Brooklyn 11238 1 Guest suite
## 4 Manhattan 10018 1 Apartment
## 5 Queens 11104 1 House
## 6 Queens 11103 1 Apartment
## room_type accommodates bathrooms bedrooms beds bed_type
## 1 Entire home/apt 5 1.5 2 2 Real Bed
## 2 Entire home/apt 4 1.0 2 1 Real Bed
## 3 Entire home/apt 3 1.0 1 1 Real Bed
## 4 Private room 2 1.0 1 1 Real Bed
## 5 Private room 2 1.5 1 1 Real Bed
## 6 Private room 2 1.0 1 1 Real Bed
## amenities
## 1 tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,essentials,hangers,hairdryer,selfcheck-in,keypad,privateentrance
## 2 tv,internet,wifi,kitchen,petsliveonthisproperty,dog(s),freestreetparking,buzzer/wirelessintercom,heating,family/kidfriendly,washer,dryer,smokedetector,carbonmonoxidedetector,fireextinguisher,hairdryer,iron,laptopfriendlyworkspace,translationmissing:enhosting_amenity_50,selfcheck-in,lockbox,privatelivingroom,hotwater,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,patioorbalcony,gardenorbackyard,bakingsheet,trashcan
## 3 tv,wifi,airconditioning,kitchen,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,selfcheck-in,smartlock,privatelivingroom,privateentrance,room-darkeningshades,hotwater,bedlinens,microwave,coffeemaker,refrigerator,dishwasher,dishesandsilverware,cookingbasics,oven,stove,gardenorbackyard
## 4 wifi,airconditioning,kitchen,buzzer/wirelessintercom,heating,smokedetector,carbonmonoxidedetector,essentials,shampoo,hangers,hairdryer,iron,hotwater,microwave,refrigerator,dishesandsilverware,cookingbasics,oven,stove,hostgreetsyou
## 5 wifi,airconditioning,kitchen,freestreetparking,heating,washer,dryer,smokedetector,carbonmonoxidedetector,firstaidkit,fireextinguisher,essentials,shampoo,lockonbedroomdoor,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,privateentrance,hotwater,microwave,coffeemaker,refrigerator,dishesandsilverware,cookingbasics,oven,stove
## 6 wifi,airconditioning,kitchen,heating,smokedetector,firstaidkit,essentials,hangers,hairdryer,iron,laptopfriendlyworkspace,privatelivingroom,coffeemaker
## price cleaning_fee guests_included extra_people minimum_nights maximum_nights
## 1 145 110 1 0 6 29
## 2 199 50 2 50 2 1125
## 3 165 75 2 50 2 12
## 4 59 49 1 25 30 1125
## 5 48 25 1 15 2 1125
## 6 80 45 1 0 2 30
## minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights
## 1 6 6 29
## 2 2 2 1125
## 3 1 3 12
## 4 30 30 1125
## 5 2 2 1125
## 6 2 2 30
## maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm
## 1 29 6 29
## 2 1125 2 1125
## 3 12 2 12
## 4 1125 30 1125
## 5 1125 2 1125
## 6 30 2 30
## calendar_updated availability_30 availability_60 availability_90
## 1 2 weeks ago 0 0 0
## 2 3 weeks ago 8 38 68
## 3 yesterday 17 28 48
## 4 4 weeks ago 11 11 11
## 5 2 months ago 0 0 12
## 6 4 weeks ago 5 35 65
## availability_365 number_of_reviews number_of_reviews_ltm first_review
## 1 18 2 2 2019-07-15
## 2 158 5 4 2017-04-11
## 3 112 52 45 2019-01-21
## 4 35 30 29 2018-05-19
## 5 73 14 14 2019-05-07
## 6 65 20 20 2019-07-01
## last_review review_scores_rating review_scores_accuracy
## 1 2019-07-20 10.0 10
## 2 2019-12-31 9.2 10
## 3 2020-03-01 10.0 10
## 4 2020-01-01 9.4 9
## 5 2020-01-01 9.9 10
## 6 2020-01-01 10.0 10
## review_scores_cleanliness review_scores_checkin review_scores_communication
## 1 10 10 10
## 2 10 10 10
## 3 10 10 10
## 4 9 9 10
## 5 9 10 10
## 6 10 10 10
## review_scores_location review_scores_value instant_bookable
## 1 10 10 0
## 2 10 8 0
## 3 10 10 0
## 4 10 9 0
## 5 10 10 1
## 6 10 10 1
## cancellation_policy require_guest_profile_picture
## 1 strict_14_with_grace_period f
## 2 strict_14_with_grace_period f
## 3 strict_14_with_grace_period f
## 4 strict_14_with_grace_period f
## 5 strict_14_with_grace_period f
## 6 flexible f
## require_guest_phone_verification calculated_host_listings_count
## 1 f 1
## 2 f 1
## 3 f 1
## 4 f 2
## 5 f 3
## 6 f 1
## calculated_host_listings_count_entire_homes
## 1 1
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
## calculated_host_listings_count_private_rooms
## 1 0
## 2 0
## 3 0
## 4 2
## 5 3
## 6 1
## calculated_host_listings_count_shared_rooms reviews_per_month zipcode_m
## 1 0 2.00 -59.044505
## 2 0 0.14 -2.906385
## 3 0 3.72 -1.465574
## 4 0 1.35 91.877609
## 5 0 1.34 -54.118285
## 6 0 2.33 -40.282820
## diff_day number_amenities 24-hourcheck-in airconditioning bathtub bbqgrill
## 1 5 16 0 1 0 0
## 2 994 35 0 0 0 0
## 3 405 34 0 1 0 0
## 4 592 20 0 1 0 0
## 5 239 28 0 1 0 0
## 6 184 13 0 1 0 0
## bedlinens breakfast buzzer/wirelessintercom cabletv carbonmonoxidedetector
## 1 0 0 0 0 1
## 2 0 0 1 0 1
## 3 1 0 0 0 1
## 4 0 0 1 0 1
## 5 0 0 0 0 1
## 6 0 0 0 0 0
## children’sbooksandtoys cleaningbeforecheckout coffeemaker cookingbasics
## 1 0 0 0 0
## 2 0 0 1 1
## 3 0 0 1 1
## 4 0 0 0 1
## 5 0 0 1 1
## 6 0 0 1 0
## dishesandsilverware dishwasher doorman dryer elevator essentials
## 1 0 0 0 1 0 1
## 2 1 1 0 1 0 0
## 3 1 1 0 1 0 1
## 4 1 0 0 0 0 1
## 5 1 0 0 1 0 1
## 6 0 0 0 0 0 1
## ethernetconnection extrapillowsandblankets family/kidfriendly
## 1 0 0 0
## 2 0 0 1
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## fireextinguisher firstaidkit freeparkingonpremises freestreetparking
## 1 1 0 0 0
## 2 1 0 0 1
## 3 1 1 0 0
## 4 0 0 0 0
## 5 1 1 0 1
## 6 0 1 0 0
## gardenorbackyard gym hairdryer hangers heating hostgreetsyou hotwater
## 1 0 0 1 1 1 0 0
## 2 1 0 1 0 1 0 1
## 3 1 0 1 1 1 0 1
## 4 0 0 1 1 1 1 1
## 5 0 0 1 1 1 0 1
## 6 0 0 1 1 1 0 0
## indoorfireplace internet iron keypad kitchen laptopfriendlyworkspace lockbox
## 1 0 0 0 1 1 0 0
## 2 0 1 1 0 1 1 1
## 3 0 0 1 0 1 1 0
## 4 0 0 1 0 1 0 0
## 5 0 0 1 0 1 1 0
## 6 0 0 1 0 1 1 0
## lockonbedroomdoor longtermstaysallowed luggagedropoffallowed microwave
## 1 0 0 0 0
## 2 0 0 0 1
## 3 1 0 0 1
## 4 0 0 0 1
## 5 1 0 0 1
## 6 0 0 0 0
## nostairsorstepstoenter other oven pack’nplay/travelcrib
## 1 0 0 0 0
## 2 0 0 1 0
## 3 0 0 1 0
## 4 0 0 1 0
## 5 0 0 1 0
## 6 0 0 0 0
## paidparkingoffpremises patioorbalcony petsallowed petsliveonthisproperty
## 1 0 0 0 0
## 2 0 1 0 1
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## privateentrance privatelivingroom refrigerator room-darkeningshades
## 1 1 0 0 0
## 2 0 1 1 0
## 3 1 1 1 1
## 4 0 0 1 0
## 5 1 1 1 0
## 6 0 1 0 0
## safetycard selfcheck-in shampoo showergel singlelevelhome smokedetector
## 1 0 1 0 0 0 1
## 2 0 1 0 0 0 1
## 3 0 1 1 0 0 1
## 4 0 0 1 0 0 1
## 5 0 0 1 0 0 1
## 6 0 0 0 0 0 1
## smokingallowed stove translationmissing:enhosting_amenity_49
## 1 0 0 0
## 2 0 1 0
## 3 0 1 0
## 4 0 1 0
## 5 0 1 0
## 6 0 0 0
## translationmissing:enhosting_amenity_50 trashcan tv washer wifi
## 1 0 0 1 1 1
## 2 1 1 1 1 1
## 3 0 0 1 1 1
## 4 0 0 0 0 1
## 5 0 0 0 1 1
## 6 0 0 0 0 1
all<- all %>%
mutate(total_room = bathrooms+bedrooms)
head(all$total_room)
## [1] 3.5 3.0 2.0 2.0 2.5 2.0
Read zip_code_database.csv, and join two tables to get longitude and latitude.
zipcode_info = read.csv("/Users/chensixian/Desktop/CU AA/5200/Kaggle/rentlala2021/zip_code_database.csv")
head(zipcode_info)
## zip type decommissioned primary_city acceptable_cities
## 1 82001 STANDARD 0 Cheyenne Fe Warren AFB
## 2 82002 UNIQUE 0 Cheyenne
## 3 82003 PO BOX 0 Cheyenne
## 4 82005 STANDARD 0 Fe Warren AFB Cheyenne
## 5 82006 UNIQUE 0 Cheyenne
## 6 82007 STANDARD 0 Cheyenne
## unacceptable_cities state county timezone area_codes
## 1 WY Laramie County America/Denver 307
## 2 State Of Wyoming WY Laramie County America/Denver 307
## 3 WY Laramie County America/Denver 307
## 4 WY Laramie County America/Denver 307
## 5 Wy State Game And Fish WY Laramie County America/Denver 307
## 6 WY Laramie County America/Denver 307
## world_region country latitude longitude irs_estimated_population
## 1 <NA> US 41.14 -104.79 32090
## 2 <NA> US 41.14 -104.79 25
## 3 <NA> US 41.14 -104.79 3208
## 4 <NA> US 41.14 -104.87 594
## 5 <NA> US 41.14 -104.79 0
## 6 <NA> US 41.08 -104.71 17660
Transform numerical zip code into string
zipcode_info$zip<- sapply(zipcode_info$zip,as.character)
head(zipcode_info)
## zip type decommissioned primary_city acceptable_cities
## 1 82001 STANDARD 0 Cheyenne Fe Warren AFB
## 2 82002 UNIQUE 0 Cheyenne
## 3 82003 PO BOX 0 Cheyenne
## 4 82005 STANDARD 0 Fe Warren AFB Cheyenne
## 5 82006 UNIQUE 0 Cheyenne
## 6 82007 STANDARD 0 Cheyenne
## unacceptable_cities state county timezone area_codes
## 1 WY Laramie County America/Denver 307
## 2 State Of Wyoming WY Laramie County America/Denver 307
## 3 WY Laramie County America/Denver 307
## 4 WY Laramie County America/Denver 307
## 5 Wy State Game And Fish WY Laramie County America/Denver 307
## 6 WY Laramie County America/Denver 307
## world_region country latitude longitude irs_estimated_population
## 1 <NA> US 41.14 -104.79 32090
## 2 <NA> US 41.14 -104.79 25
## 3 <NA> US 41.14 -104.79 3208
## 4 <NA> US 41.14 -104.87 594
## 5 <NA> US 41.14 -104.79 0
## 6 <NA> US 41.08 -104.71 17660
Use left join.
all<- left_join(x = all, y = zipcode_info[ , c("zip", "latitude","longitude")], by = c("zipcode"="zip"))
all$longitude <- abs(all$longitude)
ggplot(all,aes(longitude)) +
geom_histogram(bins=100)
ggplot(all[1:41330,],aes(longitude,latitude,color = price)) +
geom_point() +
ylim(40.5, 41) +
xlim(73, 74.5)
## Warning: Removed 1 rows containing missing values (geom_point).
coln <- c("host_since","host_response_rate","host_acceptance_rate","neighbourhood",
"host_verifications","neighbourhood","zipcode","amenities","minimum_minimum_nights","maximum_nights",
"maximum_minimum_nights","minimum_maximum_nights","maximum_maximum_nights","maximum_nights_avg_ntm","maximum_nights_avg_ntm","calendar_updated","first_review","last_review","require_guest_profile_picture",
"require_guest_phone_verification")
all1 <- all[,!names(all) %in% coln]
Split all data into train1 and test1, they have the same row numbers as the original dataset.
train1 <- all1[1:41330,]
test1 <- all1[is.na(all1$price),]
which(is.na(train1))
## integer(0)
One-hot encoding all factors.
library(vtreat)
trt = designTreatmentsZ(dframe = train1,
varlist = names(all1))
## [1] "vtreat 1.6.3 inspecting inputs Mon Nov 29 11:05:24 2021"
## [1] "designing treatments Mon Nov 29 11:05:24 2021"
## [1] " have initial level statistics Mon Nov 29 11:05:24 2021"
## [1] " scoring treatments Mon Nov 29 11:05:26 2021"
## [1] "have treatment plan Mon Nov 29 11:05:26 2021"
newvars = trt$scoreFrame[trt$scoreFrame$code%in% c('clean','lev'),'varName']
train1 = prepare(treatmentplan = trt,
dframe = train1,
varRestriction = newvars)
test1 = prepare(treatmentplan = trt,
dframe = test1,
varRestriction = newvars)
head(train1)
## id host_is_superhost host_neighbourhood host_listings_count
## 1 914965 0 -49.759367 0
## 2 262763 0 1.451335 1
## 3 279053 1 3.577991 1
## 4 385319 0 72.433628 3
## 5 476558 1 -55.268907 6
## 6 522365 1 -35.597427 1
## host_total_listings_count host_identity_verified neighbourhood_cleansed
## 1 0 0 -51.220420
## 2 1 1 3.873703
## 3 1 0 3.873703
## 4 3 1 71.085886
## 5 6 0 -58.480477
## 6 1 0 -38.990225
## is_location_exact accommodates bathrooms bedrooms beds price cleaning_fee
## 1 1 5 1.5 2 2 145 110
## 2 0 4 1.0 2 1 199 50
## 3 1 3 1.0 1 1 165 75
## 4 1 2 1.0 1 1 59 49
## 5 1 2 1.5 1 1 48 25
## 6 1 2 1.0 1 1 80 45
## guests_included extra_people minimum_nights minimum_nights_avg_ntm
## 1 1 0 6 6
## 2 2 50 2 2
## 3 2 50 2 2
## 4 1 25 30 30
## 5 1 15 2 2
## 6 1 0 2 2
## availability_30 availability_60 availability_90 availability_365
## 1 0 0 0 18
## 2 8 38 68 158
## 3 17 28 48 112
## 4 11 11 11 35
## 5 0 0 12 73
## 6 5 35 65 65
## number_of_reviews number_of_reviews_ltm review_scores_rating
## 1 2 2 10.0
## 2 5 4 9.2
## 3 52 45 10.0
## 4 30 29 9.4
## 5 14 14 9.9
## 6 20 20 10.0
## review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1 10 10 10
## 2 10 10 10
## 3 10 10 10
## 4 9 9 9
## 5 10 9 10
## 6 10 10 10
## review_scores_communication review_scores_location review_scores_value
## 1 10 10 10
## 2 10 10 8
## 3 10 10 10
## 4 10 10 9
## 5 10 10 10
## 6 10 10 10
## instant_bookable calculated_host_listings_count
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 2
## 5 1 3
## 6 1 1
## calculated_host_listings_count_entire_homes
## 1 1
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
## calculated_host_listings_count_private_rooms
## 1 0
## 2 0
## 3 0
## 4 2
## 5 3
## 6 1
## calculated_host_listings_count_shared_rooms reviews_per_month zipcode_m
## 1 0 2.00 -59.044505
## 2 0 0.14 -2.906385
## 3 0 3.72 -1.465574
## 4 0 1.35 91.877609
## 5 0 1.34 -54.118285
## 6 0 2.33 -40.282820
## diff_day number_amenities X24_minus_hourcheck_minus_in airconditioning
## 1 5 16 0 1
## 2 994 35 0 0
## 3 405 34 0 1
## 4 592 20 0 1
## 5 239 28 0 1
## 6 184 13 0 1
## bathtub bbqgrill bedlinens breakfast buzzer_slash_wirelessintercom cabletv
## 1 0 0 0 0 0 0
## 2 0 0 0 0 1 0
## 3 0 0 1 0 0 0
## 4 0 0 0 0 1 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## carbonmonoxidedetector children_sbooksandtoys cleaningbeforecheckout
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 0 0 0
## coffeemaker cookingbasics dishesandsilverware dishwasher doorman dryer
## 1 0 0 0 0 0 1
## 2 1 1 1 1 0 1
## 3 1 1 1 1 0 1
## 4 0 1 1 0 0 0
## 5 1 1 1 0 0 1
## 6 1 0 0 0 0 0
## elevator essentials ethernetconnection extrapillowsandblankets
## 1 0 1 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 1 0 0
## 5 0 1 0 0
## 6 0 1 0 0
## family_slash_kidfriendly fireextinguisher firstaidkit freeparkingonpremises
## 1 0 1 0 0
## 2 1 1 0 0
## 3 0 1 1 0
## 4 0 0 0 0
## 5 0 1 1 0
## 6 0 0 1 0
## freestreetparking gardenorbackyard gym hairdryer hangers heating
## 1 0 0 0 1 1 1
## 2 1 1 0 1 0 1
## 3 0 1 0 1 1 1
## 4 0 0 0 1 1 1
## 5 1 0 0 1 1 1
## 6 0 0 0 1 1 1
## hostgreetsyou hotwater indoorfireplace internet iron keypad kitchen
## 1 0 0 0 0 0 1 1
## 2 0 1 0 1 1 0 1
## 3 0 1 0 0 1 0 1
## 4 1 1 0 0 1 0 1
## 5 0 1 0 0 1 0 1
## 6 0 0 0 0 1 0 1
## laptopfriendlyworkspace lockbox lockonbedroomdoor longtermstaysallowed
## 1 0 0 0 0
## 2 1 1 0 0
## 3 1 0 1 0
## 4 0 0 0 0
## 5 1 0 1 0
## 6 1 0 0 0
## luggagedropoffallowed microwave nostairsorstepstoenter other oven
## 1 0 0 0 0 0
## 2 0 1 0 0 1
## 3 0 1 0 0 1
## 4 0 1 0 0 1
## 5 0 1 0 0 1
## 6 0 0 0 0 0
## pack_nplay_slash_travelcrib paidparkingoffpremises patioorbalcony petsallowed
## 1 0 0 0 0
## 2 0 0 1 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## petsliveonthisproperty privateentrance privatelivingroom refrigerator
## 1 0 1 0 0
## 2 1 0 1 1
## 3 0 1 1 1
## 4 0 0 0 1
## 5 0 1 1 1
## 6 0 0 1 0
## room_minus_darkeningshades safetycard selfcheck_minus_in shampoo showergel
## 1 0 0 1 0 0
## 2 0 0 1 0 0
## 3 1 0 1 1 0
## 4 0 0 0 1 0
## 5 0 0 0 1 0
## 6 0 0 0 0 0
## singlelevelhome smokedetector smokingallowed stove
## 1 0 1 0 0
## 2 0 1 0 1
## 3 0 1 0 1
## 4 0 1 0 1
## 5 0 1 0 1
## 6 0 1 0 0
## translationmissing_colon_enhosting_amenity_49
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## translationmissing_colon_enhosting_amenity_50 trashcan tv washer wifi
## 1 0 0 1 1 1
## 2 1 1 1 1 1
## 3 0 0 1 1 1
## 4 0 0 0 0 1
## 5 0 0 0 1 1
## 6 0 0 0 0 1
## total_room latitude longitude host_response_time_lev_x_a_few_days_or_more
## 1 3.5 40.67 73.89 0
## 2 3.0 40.68 73.96 0
## 3 2.0 40.68 73.96 0
## 4 2.0 40.76 73.99 0
## 5 2.5 40.74 73.92 0
## 6 2.0 40.76 73.91 0
## host_response_time_lev_x_unknown host_response_time_lev_x_within_a_day
## 1 0 0
## 2 0 0
## 3 0 1
## 4 0 0
## 5 0 0
## 6 0 1
## host_response_time_lev_x_within_a_few_hours
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## host_response_time_lev_x_within_an_hour
## 1 0
## 2 1
## 3 0
## 4 1
## 5 1
## 6 0
## neighbourhood_group_cleansed_lev_x_Bronx
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_group_cleansed_lev_x_Brooklyn
## 1 1
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
## neighbourhood_group_cleansed_lev_x_Manhattan
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
## neighbourhood_group_cleansed_lev_x_Queens
## 1 0
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
## neighbourhood_group_cleansed_lev_x_Staten_Island
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## property_type_lev_x_Apartment property_type_lev_x_Boutique_hotel
## 1 0 0
## 2 1 0
## 3 0 0
## 4 1 0
## 5 0 0
## 6 1 0
## property_type_lev_x_Condominium property_type_lev_x_Guest_suite
## 1 1 0
## 2 0 0
## 3 0 1
## 4 0 0
## 5 0 0
## 6 0 0
## property_type_lev_x_Hotel property_type_lev_x_House property_type_lev_x_Loft
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 1 0
## 6 0 0 0
## property_type_lev_x_Other property_type_lev_x_Townhouse
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## room_type_lev_x_Entire_home_slash_apt room_type_lev_x_Hotel_room
## 1 1 0
## 2 1 0
## 3 1 0
## 4 0 0
## 5 0 0
## 6 0 0
## room_type_lev_x_Private_room room_type_lev_x_Shared_room
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 0
## 5 1 0
## 6 1 0
## bed_type_lev_x_Airbed bed_type_lev_x_Couch bed_type_lev_x_Futon
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## bed_type_lev_x_Pull_minus_out_Sofa bed_type_lev_x_Real_Bed
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
## cancellation_policy_lev_x_flexible cancellation_policy_lev_x_moderate
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 1 0
## cancellation_policy_lev_x_strict
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## cancellation_policy_lev_x_strict_14_with_grace_period
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
## cancellation_policy_lev_x_super_strict_30
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## cancellation_policy_lev_x_super_strict_60
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
label_train <- train1$price
dtrain <- xgb.DMatrix(data = as.matrix(train1[, !names(train1) %in% c("price")] ), label= label_train)
After parameter tunning, I find max_depth =7 gave me the lowest local RMSE. And I also set gamma as 2 to prevent overfitting.
default_param<-list(
objective = "reg:linear",
booster = "gbtree",
eta=0.05, #default = 0.3
gamma=2,
max_depth=7, #default=6
min_child_weight=4, #default=1
subsample=1,
colsample_bytree=1
)
I set nround to 1000, so that model can find the best iteration.
xgbcv <- xgb.cv( params = default_param, data = dtrain, nrounds = 1000, nfold = 5, showsd = T, stratified = T, print_every_n = 40, early_stopping_rounds = 10, maximize = F)
## [11:05:28] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:28] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:29] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:29] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [11:05:30] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:168.488867+0.721354 test-rmse:168.573300+2.922263
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [41] train-rmse:58.966072+0.246791 test-rmse:66.362495+1.830743
## [81] train-rmse:48.706135+0.145505 test-rmse:60.748808+1.615544
## [121] train-rmse:45.214314+0.119480 test-rmse:59.761356+1.525036
## [161] train-rmse:43.053410+0.183099 test-rmse:59.305290+1.483833
## [201] train-rmse:41.374628+0.140586 test-rmse:59.032665+1.453495
## [241] train-rmse:39.953196+0.125722 test-rmse:58.873120+1.449741
## [281] train-rmse:38.659590+0.170006 test-rmse:58.739704+1.467543
## [321] train-rmse:37.375632+0.234299 test-rmse:58.642889+1.479525
## [361] train-rmse:36.270510+0.227390 test-rmse:58.581929+1.486024
## [401] train-rmse:35.284778+0.265717 test-rmse:58.521047+1.467086
## [441] train-rmse:34.259458+0.307754 test-rmse:58.471801+1.451686
## Stopping. Best iteration:
## [443] train-rmse:34.213709+0.314700 test-rmse:58.469186+1.449435
[424] train-rmse:35.121716+0.256536 test-rmse:58.969470+1.625035 [687] train-rmse:35.260908+0.088415 test-rmse:58.728155+1.636479 [771] train-rmse:34.294465+0.081300 test-rmse:58.859532+1.554717 [467] train-rmse:33.863055+0.325591 test-rmse:58.503620+1.433310
xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 467)
## [11:16:14] WARNING: amalgamation/../src/objective/regression_obj.cu:171: reg:linear is now deprecated in favor of reg:squarederror.
#XGBpred <- predict(xgb_mod, dtest)
#head(XGBpred)
dtrain <- xgb.DMatrix(data = as.matrix(train1[, !names(train1) %in% c("price")] ))
XGBpred_t <- predict(xgb_mod, dtrain)
head(XGBpred_t)
## [1] 166.94334 207.57115 157.97151 86.39487 57.63774 77.51135
dtest <- xgb.DMatrix(data = as.matrix(test1[, !names(test1) %in% c("price")] ))
XGBpred <- predict(xgb_mod, dtest)
head(XGBpred)
## [1] 143.81493 52.97718 90.75443 197.08588 244.56882 129.34155
Although some would argue Xgboost model has low interpretability, the plot shows some of the most important features in the Xgboost model.
library(Ckmeans.1d.dp) #required for ggplot clustering
mat <- xgb.importance (feature_names = colnames(dtest),model = xgb_mod)
xgb.ggplot.importance(importance_matrix = mat[1:30], rel_to_first = TRUE)
library(tidyverse)
my_submission <- data_frame('id' = ScoringData$id, 'price' = XGBpred)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
my_submission
## # A tibble: 10,333 x 2
## id price
## <int> <dbl>
## 1 10310 144.
## 2 10426 53.0
## 3 10583 90.8
## 4 10973 197.
## 5 11000 245.
## 6 11161 129.
## 7 11262 97.6
## 8 11561 97.2
## 9 11589 280.
## 10 11625 270.
## # … with 10,323 more rows
# save our file
write_csv(my_submission, 'submission_xgboost_v11_logtransform_-999_bin_1.csv')